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1 

METHODS FOR OPERATING A LOG 
DEVICE 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention is generally related to high- 
performance computer filesystem designs used in conjunc- 
tion with contemporary operating systems and, in particular, 
to a multi-tasking computer system employing a log device 
to support a log structured filesystem paradigm over an 
independent filesystem and the operation of the log device to 
dynamically balance filesystem I/O transactions. 

2. Description of the Related Art 

In the operation of conventional computer systems, the 
overall performance of the system is often constrained by 
the practically achievable throughput rates of secondary 
mass storage units, typically implemented variously utiliz- 
ing single disk drives and cooperatively organized arrays of 
disk drives. As the peak performance of central processing 
units has dramatically increased, performance constraints 
have significantly increased due to the relatively lesser 
advances in performance achievable by secondary mass 
storage units. Factors that affect the performance of disk 
drive type devices include, in particular, the inherent 
mechanical operation and geometric relations imposed by 
the fundamental mechanical construction and operation of 
conventional disk drives. The essentially sequential operat- 
ing nature of disk drives and the extremely disparate rates of 
data read/write, actuator seek rates and rotational latencies 
result in the performance of secondary mass storage devices 
being highly dependant on the layout and logical organiza- 
tion of data on the physical data storage surfaces of disk 
drives. 

Due to inherent asymmetries in performing read and write 
disk drive data storage operations, particularly to ensure that 
physical storage space is correctly allocated and subse- 
quently referenced, a substantial tension exists between 
optimization of the data layout for data reads and writes. 
Typically, available physical storage space roust be deter- 
mined from file allocation tables, written to, and then and 
cataloged in directory entries to perform even basic data 
writes. A sequence of physical data and directory reads are 
all that is typically required for data reads. 

Another factor that can significantly influence the opti- 
mum layout of data is the nature of the software applications 
executed by the central processing unit at any given time. 
Different optimizations can be effectively employed depend- 
ing on whether there is a preponderance of data reads as 
compared to data writes, whether large or small data block 
transfers are being performed, and whether physical disk 
accesses are highly random or substantially sequential. 
However, the mix of concurrently executing applications in 
most computer systems is difficult if not practically impos- 
sible to manage purely to enforce disk drive operation 
optimizations. Conventionally, the various trade-offs 
between different optimizations are statically established 
when defining the basic parameters of a filesystem layout. 
Although some filesystem parameters may be changeable 
without re -installing the filesystem, fundamental filesystem 
control parameters are not changeable, and certainly not 
dynamically tunable during active filesystem operation. 

An early effort to improve the performance of secondary 
mass storage devices involved providing a buffer cache 
within the primary memory of the computer system. Con- 
ventional buffer caches are logically established in the file 
read/write data stream. Repeated file accesses and random 
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accesses of a particular file close in time establish initial 
image copies of the file contents within the buffer cache. The 
subsequent references, either for reading or writing, are 
executed direcdy against the buffer cache with any file write 

5 accesses to the mass storage subsystem delayed subject to a 
periodic flushing of write data from the buffer cache to 
secondary mass storage. The buffer cache thus enables many 
file read and write operations to complete at the speed of 
main memory accesses while tending to average down the 

1Q peak access frequency of the physical secondary mass 
storage devices. 

A significant drawback of merely using a buffer cache to 
improve overall system performance arises in circumstances 
where data integrity requirements require write data to be 

15 written to a non-volatile store before the write access can be 
deemed complete. In many networked computer system 
applications, particularly where connectionless communica- 
tion protocols are utilized for file data transport over the 
network, the requirement that file write data accesses be 

20 completed to non-volatile store is a fundamental require- 
ment of the network protocol itself. Thus, conventionally, 
the file access latencies incurred in writing data to secondary 
mass storage devices are a component of and compounded 
by the latencies associated with data transport over both 

25 local and wide area networks. 

One approach to minimizing the performance impact of 
non-volatile storage write requirements has been to establish 
at least a portion of the buffer cache utilizing non-volatile 
RAM memory devices. Write data transferred to the buffer 

30 cache intended for storage by the secondary mass storage 
device is preferentially stored in the non-volatile RAM 
portion of the buffer cache. Once so stored, the file write data 
request can then be immediately confirmed as succeeding in 
writing the file data to a non-volatile store. 

35 There are a number of rather significant complexities in 
utilizing non-volatile RAM buffer caches. The write and 
read file data streams are typically separated so as to 
optimize the use of the non-volatile RAM memory for 
storing write data only. Also, substantial complexities exist 

40 under failure conditions where write file data in the non- 
volatile RAM cache must be cleared to secondary mass 
storage without reliance on any other information or data 
beyond what has been preserved in the non-volatile RAM. 
Even with these complexities, which all must be compre- 

45 hensively and concurrently handled, the use of a non- 
volatile RAM store does succeed in again reducing file write 
access latency to essentially that of non-volatile RAM 
access speeds. 

One particular and practical drawback to the use of 

50 non-volatile RAM caches is the rather substantial increased 
cost and necessarily concomitant limited size of the non- 
volatile write cache. The establishment of a non-volatile 
RAM cache either through the use of flash memory chips or 
conventional static RAM memory subsystems supported 

55 with a non-intemiptible power supply is relatively expensive 
as compared to the cost of ordinary dynamic RAM memory. 
Furthermore, the additional power requirements and physi- 
cal size of a non -volatile RAM memory unit may present 
somewhat less significant but nonetheless practical con- 

60 straints on the total size of the file write non-volatile RAM 
cache. Consequently, circumstances may exist where the 
non-volatile write cache, due to its limited size, saturates 
with file write data requests resulting in degraded response 
times that is potentially even slower than simply writing file 

65 data directly to the secondary mass storage devices. 

In order to alleviate some of the limitations of non- 
volatile RAM caches, disk caches have been proposed. The 
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use of a disk drive for the non-volatile storage of write directory information. In writing data to the log structure, 

cached data is significantly more cost effective and capable only an initial seek is required, if at all, before a continuous 

of supporting substantially larger cache memory sizes than sequential transfer of data can be made. Data bandwidth, at 

can be realized through the use of non-volatile RAM memo- least during log write operations, is greatly improved, 

ries. Although by definition a non-volatile store and capable 5 Whenever the log fills or excess data bandwidth becomes 

of being scaled to rather large capacities, disk caches again available, the log contents are parsed and transferred to the 

have file write access times that are several if not many filesystem proper. 

orders of magnitude slower than conventional main memory A logging filesystem does not reduce, but likely increases 

accesses. Consequently, disk caches are selectively con- the total number of file seeks that must be performed by the 

structed utilizing exceedingly high performance disk drives, 10 disk drive subsystem. The log itself must be managed to 

resulting in a typically modest improvement in file write invalidate logically overwritten data blocks and to merge 

access times, but again at significantly increased cost. together the product of partial overwrites. In addition, file 

In addition to the practical issues associated with using a data reads must continually evaluate the log itself to deter- 

disk drive as a cache memory, logical data management mine whether more current data resides in the log or the 

problems are also encountered. Preferably, the file read data 15 main portion of the filesystem. Consequently, while atomic 

stream is logically routed around the disk cache and sup- or block file data writes may be performed quickly with a 

ported exclusively through the operation of the RAM buffer minimum of seeking, there may actually be a decrease in the 

cache. File write data bypasses the main memory buffer overall data transfer bandwidth available from the disk drive 

cache and is written exclusively to the disk cache. Partial- due to the new cleaning and increased maintenance opera- 

larly in multi-user a networked computer system 2 rj ^ ons inherently required to support logging. For these 

environments, multiple independent read and write file reasons, hybrid logging filesystems, like the earlier disk 

accesses may be directed against a single file within a rather caches, have not been generally accepted as a cost effective 

small time frame. Since the requests are ultimately associ- way of improving the overall performance of mass storage 

ated with potentially independent processes and subsystems. 

applications, the computer operating system or at least the 2 s A relatively new filesystem architecture, often referred to 
subsystem managing the buffer and disk caches must pro- a log structured filesystem has been proposed and generally 
vide a mechanism for preserving data integrity. Write data implemented as a mechanism for significantly improving the 
requests must be continually resolved against prior writes of effective data bandwidth available from a disk drive based 
the same block of file data as stored by the disk cache. Each mass storage subsystem. A log structured filesystem pro- 
read of a file data block must be evaluated against all of the 30 vides for permanent recording of write file data in an 
data blocks held by the write disk cache. While many effectively continuous sequential log. Since data is inten- 
different bypass mechanisms and data integrity management tionally written as received continually to the end of the 
algorithms have been developed, the fundamental limitation active log, the effective write data bandwidth rises to 
of a disk cache remains. Repeated accesses to the disk cache approximately that of the data bandwidth of the disk drive 
are required not only in the ordinary transfer of write file 35 mass storage subsystem. All seek operations are minimized 
data to the cache but also in management of the cache as file data is written to the end of the active log. Read data, 
structure and in continually maintaining the current integrity as well as cleaning and data block maintenance operations, 
of the write file data stream. Consequently, the potential are the main source of seek operations, 
performance improvements achievable by a disk cache are Log structured filesystems are generally viewed as par- 
further limited in practice. ^ ticularly cost effective in being able to use the entire drive 

Significant work has been done in developing new and storage space provided by the mass storage subsystem for 
modified filesystems that tend to permit the optimal use of the log structure and obviating any possible benefit to using 
the available mass storage subsystem read and write data an external disk cache. Unlike the hybrid log filesystems, the 
bandwidth. In connection with many conventional log structured filesystem is itself the ultimate destination for 
filesystems, a substantial portion of the available data access 45 all write file data. Since the effective write performance of 
bandwidth of a disk drive based mass storage subsystem is the resulting log structured filesystem is quite high, there is 
consumed in seeking operations between data directories no benefit for pre -caching write data on one disk drive and 
and potentially fragmented parts of data files. The actual then copying the data to another disk drive, 
drive bandwidth available for writing new data to the mass The general acceptance of log structured filesystems for 
storage subsystem can be as low as five to ten percent of the 50 certain, typically write intensive, computer applications 
total drive bandwidth. Early approaches to improving write reflects the significant improvement available through the 
data efficiency include pre-ordering or reordering of seek use of a direct sequential write log structure filesystem. The 
and write operations to reduce the effective seek length available write data bandwidth, even in the presence of 
necessary to write a current portion of write stream data. continuing log cleaning and maintenance operations, can be 
Further optimizations actually encourage the writing of data 55 near or above 70 percent. In addition, log structured file- 
anywhere on the disk drive recording surfaces consistent systems provides a number of ancillary benefits involving 
with the current position of the write head and the avail- the reduced latency of atomic file data write operations and 
ability of write data space. Directory entries are then sched- improved data integrity verification following computer 
uled for later update consistent with the minimum seek system crashes. Particularly in network support related 
algorithms of earlier filesystems. 60 operations, the direct writing of write file data to a log 

In all of these optimized conventional filesystems, a structured filesystem, including directory related informa- 

substantial portion of the disk drive bandwidth is still tion as an essentially atomic operation minimizes the total 

consumed with seeking operations. Hybrid filesystems have latency seen in completing committed write network data 

been proposed to further improve bandwidth utilization. write transfer operations. Similarly, by virtue of all write file 

These hybrid filesystems typically include a sequential log 65 data operations being focused at the end of the active log, as 

created as an integral part of the filesystem. The log file is opposed to being scattered throughout the disk drive storage 

sequentially appended to with all writes of file data and space, data verification operations need only be focused on 
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evaluating just the end of the log end rather than the entire 
directory and data structures of a conventional filesystem. 
Consequently, both network data write operations and the 
integrity of all data written is improved by writing directly 
to a permanent log structured filesystem. 

Log structured filesystems are, however, not entirely 
effective in all computing environments. For example, log 
structured filesystems show little improvement over conven- 
tional filesystems where the computing environment is sub- 
ject to a large percentage of fragmentary data writes and 
sequential data reads such as may occur frequently in 
transactional data base applications. The write data optimi- 
zations provided by log structured filesystems can also be 
rather inefficient in a variety of other circumstances as well, 
such as where random and small data block read accesses are 
dominant. Indeed, as computer systems continue to grow in 
power and are required to support more and different appli- 
cation environments concurrently with respect to a common 
mass storage subsystem, the tension between applications 
for optimal use of the disk drive bandwidth provided by the 
mass storage system will only continue to increase. 

Therefore, a substantial need now exists for a new file- 
system architecture that is optimized, during the ongoing 
operation, for both read and write accesses concurrent with 
processes for ensuring data integrity and fast crash recovery, 
and the many practical issues involved in providing and 
managing a high performance filesystem. 

SUMMARY OF THE INVENTION 

Thus, a general purpose of the present invention is to 
provide for a new log device system architecture that 
maintains many of the advantages of conventional log 
structured filesystems while providing a much wider range 
of support for different and concurrently executed applica- 
tion environments. 

This is achieved through the use of a data storage sub- 
system that provides for the efficient storage and retrieval of 
data with respect to an operating system executing on a 
computer system coupled to the data storage system. The 
data storage system includes a storage device providing for 
^e^sj^fi e-^j^e^teT mincd^lerand system data, as pro- 
vided by the computer system, within a main filesystem 
layout established in the storage device. The data storage 
system also includes a log device coupled in the logical data 
transfer path between storage device and the computer 
system.^Ebe^g^devic^^ 



optimized independently of the~ log. device_to_besL- serve 
opLinu^iQnaypicalLy_foi-read.accesses, such as by file and 
directory clustering and various forms of data striping to 
improve the integrity and logical survivability of the mass 
storage device. 

Another advantage of the present invention is that the log 
device realizes the many practical advantages of log struc- 
tured filesystems while avoiding the log maintenance related 
data wrap around and compaction problems associated with 
conventional log structured filesystems. In the present 
invention, relocation of segment data on the log device is 
asynchronous with log device writes. Compaction of data in 
segments is performed synchronous with the write point 
cleaning management operation. Since the log device of the 
present invention is not subject to independent compaction 
processes, links between sequential segments are not - 
required and the log structured filesystem layout of the 
present invention is inherently more error resistant than 
conventional log structured filesystems. 

A further advantage of the present invention is that it may 
be implemented as a pseudo-device driver that may be 
transparently inserted within conventional operating system 
layered structures at any of a number of logical connection 
points. The different operating system insertion points can 
be chosen between based on application program dependant 
access strategies including the handling of page lists and I/O 
buffers as passed between the pseudo-device driver of the 
present invention and the operating system. As a result, 
partial data segments may be selectively used as needed in 
quickly writing data to the log device. Such partial data 
segments will be automatically compacted as the segments 
are cleaned by the asynchronous background cleaning pro- 
cess of the present invention. 

Yet another advantage of the present invention is that a 
system of entry point overlay, or stealing, allows the log 
device to be logically switched in and out of the data stream 
between the operating system and the mass storage drives 
based on a variety of parameters including the current write 
controlling application environment. A system of translation 
checking insures that the most current read and write file 
data within the scope of the log device is identified in 
response to file data reads. 

Still another advantage of the present invention is that the 
operation and management of the log device allows for 
transparent compression and other data transforms, such as 
f encryption and data hardening, to be selectively applied to 
structured »t he log data as stored in the log structured filesystem of the 
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gouecstajjftst^ t^contf9l^(pgrain^p log device. In particular, high-speed compression of data to 

_^clwle< ir-as^rrqf^^ be stored on the log device permits a larger effective main 

nsonneetiuij 'with the 7 iog-rde^ filesystem space to be logged, thereby improving the effec- 

^a^ejjseTdi-over^ live read performance available from the main filesystem 



^ p1l 'milf. H h y a firct frpp Hafa 'irgmpnt nnri'nn nlflpfiTfillrrl 
data-segment, ao^lecqv,eLv,deaj i_the_oJdes t_Med-data 
segnienUtQjtlie, fi rst , f recdata-seffiaca t, and to-eelccliueiy 
traasfe^he^edeiermme^^le-arKL 

devicf.-lo-ih&-stQragc~de-vice. The control program utilizes 
location data provided in the predetermined file and system 
data to identify a destination storage location for the prede- 
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while maintaining the improved write performance provided 
by the log device. 

Yet still another advantage of the present invention is that 
main filesystem directory information is encoded into sys- 
tem data that is stored with file data within the data segments 
stored by the log device. This encoded information permits 
high performance data location mapping algorithms to pro- 
vide for translation and inverse translation of log device 
stored block numbers while minimizing the use of the main 
memory of a computer system for storing the translation 
maps. A strategy of progressive writing of partial translation 



termined file and system data within the main filesystem 
^adv^fltagg ofnggjgffi 
^aTjrig maira g ei ne nl 'uf^^OR^device-k-mdependent^fuu^ 

openujon^aflqV^^ and related data to the log device in conjunction with 

^u^ m-jtesyslejai^ 65 or as part of data-Se gments ensure s_tha Uhe lo g_dey.ice.is 

^ — pendent-ope ration"and-management of the log device allows robust.and.easil v recoverable in the evenTof system crash es 

the operation and management of the main filesystem to be and the like. 
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A still further advantage of the present invention is that DETAILED DESCRIPTION OF THE 

the log device itself may be physically structured as a INVENTION 

mirrored or RAID based disk drive subsystem operating A . . e j lj- 

c . ,. a j * i « • 3 % . & A computer system 10 including a preferred embodiment 

from the same or a different disk dnve controller as the main f , u „ , . # . . • riV- i au . 

C1 j • m .... . , . of the present invention is shown m FIG. 1. A host processor 

filesystem storage devices. The reliability of the log device 5 \ , . 4 ,™ TT x +~ _ f 

j ...r , , . , . . & , or central processing unit (CPU) 12 connects through a 

and recoverability of the data stored by the log device can be . . \ A t ° . v -■ * .i_ . * j.. 

... 1 j . . *. • . 1 t_ j j ■ system bus 14 to a main memory 16 that is used to store 

readily scaled to match appropriately the speed and integrity . , 

...... - . ci \ « . J apphcation programs and data as well as an operating 

capabilities of the main filesystem storage device. „ j. * c . i- ™_ 

r . ' system used to supervise execution of the applications. The 

A yet still further advantage of the present invention is operating system conventionally includes a filesystem 
that the independent operation of the log device permits 10 modal ^ ^ ^ me UNIX ® File System 
independent management operations to be performed on the and device drivere suitable for establishing the 
main filesystem storage device in a dynamic manner. 0 pe rat ive interface between the operating system and van- 
Through the operation of the log device, the fundamental ous peripherals accessible by the CPU 12 through the system 
filesystem structure of the main filesystem storage device bus 14 Optionally, a non-volatile RAM block (NVRAM) 18 
can be changed between a mirrored and arrayed structure » can be provided ^ an adjunct t0 me main memory 16 for 
and between different levels of RAID structure through sloring ccrlain relativcly data mat may ^ advanta- 
progressive management of alterations in the translations geously utilized during the operation of the computer system 
maps relanng logical and physical storage of file and system 10 ^ potentially during rec0 very from a system crash 
data by the mam filesystem storage device. condition 

Still another advantage of the present invention is that the 20 In accordance ^ lhe pre f er red embodiment of the 

log device may be implemented and operated without main present mvention> a log disk controller 20 interconnects the 

filesystem disks actually being used. UWe^a^rites^iy system bus 14 ^ a mirrored of ^ M ^ 0f 

patly-predominate^d^^ more disk drives 22 that provide the op€rative st0 for , 

has-a-relativ^^^ disk data; logically referred tQ as the availablc log device 

effiaently^cted^ 25 data A ^ ^ M ides a 

storage and then expired when no longer needed. tional interface between the syslem bus 14 and main disk 

BRIEF DESCRIPTION OF THE DRAWINGS storage 26. In a preferred embodiment of the present 

™ juj. jr r.._ invention, the log disk controller 20 and main disk controller 

These and other advantages and features of the present * A u • n .u j* 1 * n 

... , . . , . - . an 24 ma y De physically the same disk controller supporting 

invention will become better understood upon consideration 30 hxIM \^^)u, i^n^nt n f a^,^ n 1/ a™„ „ 

f . c „ . 1 .« j « . . r . r . . , two logically independent sets of disk drives 22, 26. Agam, 

of the following detailed description of the invention when # . a 7 * r 1 j • • *u 1 j • j- 1 j * 

.... 6 . e .* , . the first set of disk drives comprise the log device disk drives 

considered m connection ot me accompanying drawings, in „ ^ ^ 

set of drives is preferably organized as 

which like reference numerals designate like parts through- , djsk drive Qf f ^ ^e^J £ 

out the figures thereof, and wherein: tionaUy main disk ^ 26 

FIG. 1 is a basic block diagram of a computer system In the preferred embodiment of the present invention, the 

implementing a log device in accordance with the present cpu n £ ay ^ any of a variety rf processor ^ 



and chip sets ranging from, for example, an Intel® 80586, a 



invention; 

FIG. 2 is a data flow diagram illustrating the effective Motorola" Pow7rPC®7 and 7 Su7*Micr^y7teri"lj7raS 

insertion of the log device pseudo-device driver into the pare® processor. The operating system executed by the CPU 

lower levels of the operating system to intercept and selec- 12 may be one of the many UNIX® or unix-like operating 

lively route file and system data with respect to the log system that m freely or commercially available, Microsoft 

device disk and through the mam disk device driver with Window-NT® Server, or another generally similar operating 

respect to the mam filesystem storage device; systenr ^ preferred operating system used in connection 

FIG. 3 provides a control and data flow diagram illustrat- 45 with the present invention is the Unix-based Solaris® 2.0 

ing the control relationships between different component operating system available from Sun MicroSystems®, Inc. 

parte of the log device pseudo-devioe -driver in relationship Unix and olher operating systeras lhat provide an 

to the core operating system, tfi r, main filorffpc faM riryj r r, abstracted device independent interface to a mass storage 

d^r^aUke^ihy^^ subsystem can readily be adopted to utilize the present 

anrt - main Mncag e ^ evaca; 5Q ^^0^ Even where a device independent mass storage 

FIG. 4 is a logic flow diagram illustrating the translation subsystem interface is not normally provided as part of an 

of data location identifiers with respect to the main storage operating system, such as an embedded real-time operating 

device into a location identifier appropriate for use in the systcnij me operating system can be modified, consistent 

storage of segment data by the log device; wi th principles of the present invention, to utilize a log 

FIGS. Sa-h provides progressively exploded views of the 55 device for improving the overall performance of any 

logical layout structure utilized in a preferred embodiment attached secondary mass storage subsystem. Finally, suit- 

of the present invention for the storage of data segments able operating systems usable in connection with the present 

within the log structured layout implemented within the log invention will preferably provide support for multi-tasking 

device; operations sufficient to allow one or more asynchronous user 

FIGS. 6a-c provide graphical representations and suben- 60 and/or kernel level processes to provide for background 

tity identifications for the super map, map cache, and operating management of the log device 20, 22. The struc- 

segment table utilized in the translation algorithms of a tu re and ope ration of a generalized Unix operating system is 

preferred embodiment of the present invention; and detailed in "The Design of the Unix Operating System," by 
FIG. 7 provides a data flow diagram illustrating <th.Qm*» M. J. Bach, published by Prentice-Hall, Inc., 1986, which is 

^pf^aiinniaTTrinraBagEmgfgg^ expressly incorporated herein by reference. 

totfae^^e^oe^d,the_as^^gejh^ A data flow diagram 30, showing modified and generally 

Ajmu.mA»gg preferred data transport paths in a preferred embodiment of 
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the present invention, is presented in FIG. 2. An operating types of data that are stored by the log device is preferably 

system core 32 including a file system module, collectively a dynamically alterable characteristic tuned in light of the 

referred to as a Unix kernel, exchanges device independent particular nature and use of the logged filesystem. Where 

data streams with a main disk device driver 34 via data paths read and write data is generally small and random, all write 

36, 38. The main disk device driver 34 provides the neces- 5 data, including meta-data, may be written to the log device 

sary and conventional support for transferring the device disk 52. If user data writes are typically large or determined 

independent data streams to the main filesystem disks 40 in to exceed a programmable size threshold, the log device 

a manner consistent with the device dependencies necessary pseudo-device driver 44 may direct such writes directly to 

to secure data storage and retrieval with respect to the main the logged filesystem. Small meta-data write and perhaps 

filesystem disks 40. 10 small user data writes might then be staged to the log device 

In accordance with the preferred embodiment of the disks pending a large write migration of such data to the 

present invention, a log device pseudo-device driver 44 is logged filesystem itself. In some situations, the predominant 

provided in connection with the operating system core 32 component of the write data for a logged filesystem may be 

and main disk device driver 34. Preferably, the data stream a transaction log or the like. As an essentially write only data 

exit and entry points of the operating system 32, otherwise 15 object, the write data of a transaction log might alone be 

connected to the paths 36, 38, are instead routed via paths directed to the log device disks 52 and later migrated to 

42, 50 to the log device pseudo-device driver 44. The data archival storage or simply discarded as appropriate for the 

stream entry and exit points of the main disk device driver use of the transaction log. Indeed, anytime where data reads 

are also coupled to the log device pseudo-device driver 44 greatly predominate data writes, and the volume of new data 

^ via new data transfer paths 46, 48. Thus, alkastdataJhaUs 2 o writes, as opposed to data overwrites, relative to the size of 

toJ2e^vjittea4Q~or-jead-from a selected filesystem nominally the log device disks 52 is small, the log device disks 52 can 

mainta i n ed on th e mainJle^y^ejiLdisks--40-isxQU ted t h rou gh be used stand-alone. That is, while the operating system core 

the^lQg-deyicg_pseU-dQrdevice,d rive r 44 and mayJxLmade 32 fully implements a filesystem for managing the storage of 

subjeet-tO'the-contrQLQperatiQiis.e&tabl ished hyJheexeculion data on the main filesystem disks 40, the complete data 

aphe^og-device,pseudordevic£Ldriver 44. Specifically^the 2 5 storage function is performed by the log device disks 52 

log_device pse udo -device d river-44_selectivelv provides for alone. Physical main filesystem disks 40 may not actually be 

.the-routing-of-filesystem implemented. The existence and operation of the log device 

feksjU).to.be-at4east.temporarily,st ored and potentially rea d pseudo-device driver 44 is transparent to the filesystem 

back-from-the-log-deviceLdisks-52. paradigm implemented by the operating system core 32. 

By the particular construction of a separate specialized 30 Consequently, the present invention allows a wide spectrum 

file data layout on the log device disks 52 that is indepen- of log device usage strategies to be constructed and imple- 

dently operated under control of the log device pseudo- mented in or through the log device pseudo-device driver 44 

device driver 44, the present invention provides for a highly based on simple, easily distinguished characteristics of 

throughput optimized apparent filesystem write data transfer particular data being written to a logged filesystem. 

path from the operating system core 32 through the log 35 A cleaner daemon is periodically executed under the 

device pseudo-device driver 44 to the log device disk 52. control of the operating system core 32 to progressively 

This optimized write data path through the log device clean main filesystem data as stored on the log device disk 

pseudo-device driver 44 is effectively independent of any 52 and, subject to a number of constraint conditions, provide 

read data path 38 that may selectively remain for other for the migration of such filesystem data from the log device 

logical devices supported by the main disk device driver 34. 40 disk 52 through the main disk device driver 34 for final 

The read data path 48, 50 from the main file system disks 40 storage on the main filesystem disks 40. Thus, reads of 

through the log device pseudo-device driver 44 is only recently written data that can no longer be satisfied through 

slightly less independent of any active write data path to the the use of the buffer cache and that has not been cleaned to 

log device pseudo-device driver 44 due to any potential the main filesystem disks 40 are satisfied from the log device 

competition in execution of the log device pseudo-device 45 disks 52. 

driver 44 itself. The write data path 42 through the log Organization of the data stored by the log device disk 52 

device pseudo-device driver 44 to the log device disk 52 will in a physical layout independent of the filesystem format of 

be co-dependant on the concurrent use of the read data path the main filesystem disks provides a number of advantages 

from the log device disk 52 through the log device pseudo- over other data organizations including disk caches and 

device driver 44 to the operating system core 32 via the path 50 stream buffers. The establishment of an ordered data layout 

50. The significance of this co-dependance is mitigated by 0 n the log device disks 52 enables stored data to be readily 

the preferred utilization of a conventional buffer cache for recovered in the event of a system crash. All completed 

servicing repeated reads within the operating system core 32 atomic transfers of data to the log device disks 52 are fully 

itself and substantially avoiding the re-reading of data very recoverable from the log device disks 52 without necessary 

recently written to the log device disks 52. Thus, by the 55 reference to any control data as may, for example, be stored 

preferred operation of the present invention, relatively infre- transiently in main memory 16 or at significant cost, in the 

quently read data stored by a logged filesystem, i.e., a NVRAM memory 18, at the time of the crash. Each atomic 

filesystem provided 00 the main filesystem disks that is write transaction of user data to the log device disks 52 

subject to the logging operation of the log device pseudo- includes, by operation of the log device pseudo-device 

device driver 44, is predominately serviced from the main 6 0 driver 44, encoded system data, also referred to as meta-data 

filesystem disks 40. anc j distinguished from user data, that can be subsequently 

In general, all write data directed to a logged filesystem on decoded to determine the intended destination of the data 

the main filesystem disks 40 is written to the log device disks within a specific logged filesystem on the main filesystem 

52. A number of selective exceptions and conditions may be disk 40. As a consequence, the resident store of completed 

recognized by the log device pseudo-device driver 44 where 65 atomic transactions held by the filesystem on the log device 

logged filesystem write data may be nonetheless written disks 52 need only be progressively examined during a 

directly to the main filesystem disks 40. The desired type or system recovery operation to determine where valid data 
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found on the log device disks 52 is to be stored in a 
corresponding logged filesystem on the main filesystem 
disks 40. The progressive examination need only proceed 
from the last known good data checkpoint in the operation 
of the log device disks 52. The encoded location data enables 
the performance of data migration from the log device disk 
52 to the main filesystem disk 40 both in reconstruction of 
the main filesystem of the computer system 10 and simpli- 
fies execution of the background cleaning process. 

In addition, encoding and storing system location data 
with the user data as stored in data segments in the filesys- 
tem of the log device disks 52 permits the data segments to 
be manipulated and relocated within the filesystem on the 
log device disks 52. Both data relocation and block com- 
paction can be performed by the cleaner process on an 
ongoing basis independent of the writing of new data 
segments to the filesystem on the log device disks 52 and the 
subsequent migration of data to the logged filesystem on the 
main filesystem disks 40. Thus, the segment storage space 
represented by the log device disk 52 is continually cleaned 
and optimized to receive immediate writes of data segments 
from the operating system 32 and to selectively defer 
migration of data to the main filesystem disks 40 in opti- 
mization of filesystem data reads by the operating system 
core 32. 

A detailed control and data flow diagram 60 is provided 
in FIG. 3 illustrating the primary control and data relations 
between a preferred log-disk pseudo-device driver 61, as 
made up by modules 64-78, 84-88, the kernel mode oper- 
ating system core 62, the main disk device driver 34 and 
various user and/or kernel mode daemon 90 executed as 
background processes by the operating system core 32. The 
kernel mode operating system core 62 represents at least the 
core portion of an operating system that presents an effec- 
tively device independent device driver interface that con- 
ventionally allows device dependencies, such as the low- 
level specifics of a physical filesystem layout, to be handled 
within an attached device driver. 

As reflected in FIG. 2, a main disk device driver 54 is 
conventionally provided to support the main filesystem 
device dependencies. These device dependencies are rela- 
tively conventional and typically provide for the establish- 
ment of I/O control, raw, and buffered data stream interfaces 
to the device driver interface of the operating system core 
62. Typically, and preferably, the operating system core 62 
includes a conventional device independent Unix filesystem 
(UFS) or equivalent component that establishes the logical 
layout of a filesystem maintained on the main filesystem 
disks 40. The interface presented by the operating system 62 
to any relevant attached device drivers includes open, close, 
read, write and I/O control (IOCTL) interface points, as 
appropriate for both raw character data streams and, in 
particular, buffered block data streams. These interface 
points route through character and block device switch 
tables that serve to direct the potentially many different data 
streams supported by the operating system to one or more 
resident device drivers. Specifically, the data streams are 
associated with logical major and minor device numbers that 
are correlated through the switch tables to specify a corre- 
sponding device driver entry point to call for the transfer a 
particular stream of character or block data with respect to 
the operating system core 62. The switch tables thus effec- 
tively operate to selectively map the device driver interface 
of the operating system core 62 to a specific corresponding 
interface of a device driver. The device switch tables are 
initially constructed in connection with the initialization of 
each of the device drivers associated with the operating 
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system core 62. The device driver initialization routines are 
effectively discovered and executed by the operating system 
core 62 during global core operating system initialization. 

The device driver initialization routine provided effec- 
5 tively within the IOCTL interface 64 of the log device 
pseudo-device driver 61 is called subsequently to the execu- 
tion of the main disk device driver 34 initialization routine. 
With the device switch tables already initialized with entry 
points for the main disk device driver, the initialization 
routine of the pseudo-device driver selectively calls an I/O 
steal routine 68 that operates to remap selected device driver 
pointer entries within the device switch tables. A configu- 
ration data file stored within the main filesystem is read at 
the request of the I/O steal routine 68. The configuration file, 
preferably designated as /etc/dx.conf, supplies an identifi- 
cation and any applicable qualifications on the intercession 
of the log disk pseudo-device driver with respect to a 
specific potentially logged filesystem provided on the main 
filesystem disks 40. 

The I/O steal routine 68, when invoked during initializa- 
tion or subsequently to initiate the logging of a specific 
filesystem, copies down the entry points for the main disk 
device driver for each logged filesystem. Filesystems that 
are to be logged by default are identified in the /etc/dx.conf 
file. Corresponding entry point addresses are written into the 
switch tables to refer to the pseudo-device driver 61 in place 
of the main disk device driver 34. Consequently, the log 
device pseudo-device driver 61 is in fact called and used for 
the exchange of data with the operating system whenever an 
operating system core level reference is made to a logged 
filesystem as initially identified in the /etc/dx.conf file or 
subsequently identified to the log device pseudo -device 
driver 61 through the use of a utility command. 

The data path through the pseudo-device driver passes 
through a data interface 66. Depending on any filesystem 
specific options specified in the /etc/dx.conf file for a par- 
ticular logged filesystem, the data stream may be effectively 
bypassed through the pseudo-device driver 61 by transfer 
directly through the data interface 66 and I/O steal routine 68 
to the main disk device driver 34. The bypassed data is 
processed in a conventional manner by the main disk device 
driver and passed to a main disk controller 82 which is 
logically, if not physically distinct from the log disk con- 
troller 80. In this manner, the conventional behavior of the 
operating system and main disk device driver 34 is main- 
tained for data transfer operations where intercession of the 
log device is not desired or has been suspended. 

Where utilization of the log device is desired, the data 
path extends between the operating system 62 through the 
data interface 66, a segment I/O routine 78, the I/O steal 
routine 68, and a main disk device driver. Where a physi- 
cally separate log disk controller 80 is used, a separate or 
other equivalent instance of the main disk device driver 34 
is used by the log device pseudo-device driver 61. 
Conversely, where the main and log disk controllers are 
physically the same controller, a single instance of the main 
disk device driver may be used. In either event, data modi- 
fied by the log device pseudo-device driver in connection 
with the execution of segment I/O routines 78, is directed 
ultimately to or from the log device disks 52. 

The internal pseudo-device driver data path is managed 
by a number of control routines implemented within the log 
device pseudo-device drive 61. The IOCTL interface 64 
operates at least as the block data and driver configuration 
control interface to the operating system 64. While block 
read and write entry points are separately provided by the 
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data interface 66, the IOCTL interface 64 provides the open, be allocated for the translation and segment tables required 

close, IOCTL, and strategy entry points used by the oper- in operation by the log device pseudo-device driver. The 

ating system core 62. routine invoked by the Open Log Device IOCTL command 

Log device specific IOCTL commands are utilized to is also executed in response to Device Attach and Open 

select and control the extended functionality by the log 5 device driver calls. 

device pseudo-device driver 61. One such IOCTL command ^ q^ b ^ Dev ice IOCTL command directs the log 

implements . i directive that is passed to the data interface 66 device ^ 61 10 read in lne appropriate lo | 

to enable or disable any of a number of data transformation device ^ erblock . lf ^ superblock * ln f 

operations including, in particular, symmetric data compres- ■ j • j j ■ . 

*L„ „„j Aa — - u u ir * . -c j ltJ g device pseudo-device dnver 61 returns an EIO error 

sion and decompression on behalf of an argument specified , n ,. n .u .u i j • j j V. 

logged filesystem. Other data transforms sud, as encryption 10 ^ , °f ™f ' ****** P^^evjce dnver 61 

and data hardening may also be implemented through the f™ 1 * kemel ™ m01 * f ° r f a ^ b " ffe f ■ 

operation of the data interface 66. Other directives supported tion and segment maps, and device tables. The log device 

by the IOCTL interface 64 are used to signal the I/O steal dnver 61 then proceeds to read the corre- 

routine 68 to temporarily suspend and resume log device 1S «P°^°g mapsjrom the tog device mto me aUcKated kernel 

support for a particular filesystem and to add anew or mem0ry - ™ 6 deV1Ce Ub ! 6 * ™f d r a S amst cunent 

completely withdraw intercession on behalf of an identified * yS ^8™° >f necessary. The log 

m «:.T,r«.i/fli-™. ™ a *u j- u . i j * device pseudo-device driver 61 then updates the time of the 

main disk filesystem. Another directive may be provided to , ... . i , • , " 

the log device super block manager 70 to signal that super ^ ° pe f n " ™™ JT ^ ^ f, , 8 * I' 

block related information should be flushed to or restored 20 and *° S * ^ ^^lock If an automatic attachment mode 

from the super blocks physically residing on the log disk. 2 ° 15 s ^ sd for *° lo f ^ ed ? ^ Stem ^ » command argu- 

Other directives can be issued to configure log maps used ^nt or parameter of the /etc/dxxonf file the log device 

internally by the log device pseudo-device driver for the *™ 61 ^mediately attaches the identified 

desired size and number of map entries that are needed in ^ d b ? SiC ^ <x»rrespondmg entry points 

operative support of a specific instance of the log device. 2S ° f the main file S * Stem device 

Other directives may signal, preferably issued periodically Get md Device Para nieters (LOG_GET_PARMS; 

during the ongoing operation of the log device, that entire LOG_SET_PARMS) IOCTL commands retrieve or set, 

copies of the log maps be flushed to the log device disks 52 respectively, log device pseudo-device driver 61 parameters 

to comprehensively update or check-point the system data mt0 ' or from » a configuration parameter data structure within 

held by the log device. Such periodic flushing of all relevant me kernel mcmorv s P ace of the log device pseudo-device 

meta-data to the log device disks 52 ensures that a quite dnver 61 * ^ onl y lo S device pseudo-device driver 61 

recent set of the log maps are always present on the log parameters that cannot be altered in this fashion are the log 

device disks 52. disk geometry, and the nature and location of the sector 

In a preferred embodiment of the present invention, the rnarkers. 

log device pseudo-device driver distinguishes between 35 ^ Extend L°g (LOG_EXTEND) IOCTL command is 

native IOCTL commands, issued by the operating system t0 inform the log device pseudo-device driver 61 that 

core 62 for execution by the main filesystem device driver the lo S device nas been extended, such as by the addition of 

34, and local IOCTL commands that are to be executed by a new lo S device disk ' ^ lo § device pseudo-device driver 

the log device pseudo device driver 61. The native disk 61 a ^ umes that toe new portion of the physical log space 

driver IOCTL commands intercepted through the remapped, 40 consists of free data segments, updates the segment map 

or stolen, device switch table entry points are handled in accordingly, and posts the segment maps and the superblock. 

three distinct ways, depending on their intended function. An Attach Logged Device (LOG_DEV__ATTACH) 

Native IOCTL commands that retrieve or set device specific IOCTL command instructs the log device pseudo-device 

information, such as device state or geometry, are passed driver 61 to immediately attach a particular logged filesys- 

through to the main filesystem device driver 34 for conven- 45 tem t0 the log device pseudo-device driver 61 by stealing the 

tional execution. IOCTL commands that conflict with the corresponding main file system device driver entry points, 

logging of a main filesystem are rejected, typically with A Suspend Device Logging (LOG_DEV_SUSPEND) 

ENXIO return codes. Lastly, IOCTL commands that per- IOCTL command instructs the log device pseudo-device 

form special read or write functions, such as to detect and driver 61 to suspend the logging of write requests by a 

resolve the existence of inconsistent log device disk mirrors, 50 specified log device or by all logged devices. However, a 

are processed through a log device data block location suspended log device for a particular logged filesystem does 

translation algorithm and executed by the main filesystem not become completely inactive, since all read and write 

device driver 61 against the disks of the log device. requests must still be filtered through the translation maps to 

Local IOCTL commands generally correspond to and ensure that current data blocks are returned, in the case of 

invoke routines within the log device pseudo-device driver 55 read requests, and matching translations are invalidated, in 

61. Consequently, the user and/or kernel mode daemons 90 the case of write requests. 

can equally utilize the operating system core 62 to pass A Resume Device Logging (LOG J>EV__RESUME) 
through the IOCTL commands. IOCTL command instructs the log device pseudo-device 
An Open Log Device (LOG_OPEN) IOCTL commands driver 61 to resume the logging of write requests for an 
opens the log device for the filesystem whose name is passed 60 argument specified logged filesystem or for all logged file- 
in as an argument to the command. If no argument is systems. 

provided, the log device pseudo-device driver 61 attempts to A Sync Logged Device (LOG__DEV_SYNC) IOCTL 
open the log device for the filesystems specified in a command instructs the log device pseudo-device driver 61 to 
/etc/dx.conf configuration file. Upon success, the log device set the state of a logged device to synchronized (syne'd). 
pseudo-device driver 61 returns the device number of the 65 This command does not actually perform any data move- 
specified log device. The command fails if the identified log ment to sync logged data back to the main filesystem disks, 
device is busy, the log is dirty, or if not enough memory can Rather a separate user mode utility program is used to 
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suspend logging by a log disk relative to a specified main 
filesystem and to flush corresponding currently logged data 
to that main filesystem. The Sync Logged Device IOCTL 
command is then issued by the user mode utility to obtain 
confirmation of the completion of the sync operation or to 
receive an EBUSY return code if any logged data remains in 
the log device for the specified logged filesystem. 

A Detach Logged Device (LOG_DEV_DETACH) 
IOCTL command instructs the log device pseudo-device 
driver 61 to detach a specified logged filesystem, or all 
logged filesystems. The log device must be successfully 
sync'd for the log device pseudo-device driver 61 to execute 
this command. 

A Read Superblock (LOG_READ_SUPERB) IOCTL 
command returns the currently active superblock. A buffer of 
appropriate size must be provided to receive the superblock 
copy. 

A Flush Superblock (LOG_FLUSH_SUPERB) IOCTL 
command instructs the log device pseudo-device driver 61 to 
flush the currently active superblock to the log device disks. 

A Read Segment Table (LOG _READ_SEGTB) IOCTL 
command copies a portion of the segment table into the 
address space of a user or kernel mode daemon 90. The 
daemon 90 must provide a buffer of appropriate size. This 
IOCTL is primarily used in connection with the periodic 
cleaning of data segments. 

A Flush Segment Table (LOG FLUSH_SEGTB) IOCTL 
command instructs the log device pseudo-device driver 61 to 
flush the segment table to the log device disks. 

A Read Translation Maps (LOG_READ_MAPS) 
IOCTL command copies a portion of the translation maps 
into the address space of a user or kernel mode daemon 90. 
The daemon 90 must provide a buffer of appropriate size. 

A Flush Translation Maps (LOG_FLUSH_MAPS) 
IOCTL command instructs the log device pseudo-device 
driver 61 to immediately flush all the translation maps to the 
log device and post the superblock. Since writes are blocked 
while the translation maps are being flushed, this command 
effectively suspends all activity by the log device until the 
flush is successfully completed. 

A Wait on Log Event (LOG_EV_WAIT) IOCTL com- 
mand suspends a calling process until one or more specified 
log events occur. Waiting on log device events is imple- 
mented by internal log device pseudo-device driver 61 
condition variables. The following events can be waited on: 

log device free space below or above the low or high free 
space water mark; 

log wraparound; 

device attach/suspend/resume/detach; 

log chunk crossing. 

E&RCTER£a^(fe@@^RE^ 
address bypasses location address translation to read data 
from a specified log block address on a log device. This 
command is used by a user or kernel mode daemon 90 to 
read segment summary blocks in connection with the log 
cleaning process. 

bypasses location adaress 'translation to write data to a 
specified log block address on a log device. This command 
provides three argument selectable functions: 

1) as a Write Conditional IOCTL, the command allows 
the log device pseudo-device driver 61 to decide 
whether to log the data, or write it to the specified 
address. This command is typically used in connection 
with the log cleaning process. The determination of 
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where to write the log block provided with the com- 
mand is made by the log device pseudo-device driver 
61 depending on the then current I/O load and log 
device disk queue lengths. When the log block is to be 
written to the specified address, existing translations for 
any portion of the log block, if any, must be invalidated. 
Rewriting the data to the log as part of a new data 
segment is less expensive, since no additional I/O is 
needed to invalidate translations in this case. A data 
segment can be reclaimed when all the valid data 
blocks in the data segment have been conditionally 
rewritten to other existing data segments; 

2) as a Write Forced IOCTL, the command forces the log 
device pseudo-device driver 61 to write the log block 
provided with the command at the specified log block 
address location. This command is used in connection 
with the synching of the log device. Any existing 
translations for any overwritten portion of the log 
block, if any, must be invalidated; and 

3) as a Write Without Invalidation IOCTL, the command 
forces the log device pseudo-device driver 61 to write 
data at the specified log block location without invali- 
dating any existing translations. This command is used 
to prepare a log device for use as a self-contained log 
device, specifically in connection with writing log 
device me ta -data. 

A Read Log Segment (LOG _READ„SEG) IOCTL 
command reads an entire data segment into the address 
space of a user or kernel mode daemon 90. This command 
is used in connection with the log cleaning process to obtain 
log tail data segments. The daemon 90 must provide a buffer 
of appropriate size. 

A Free Log Segment (LOG_FREE__SEC) IOCTL com- 
mand instructs the log device pseudo-device driver 61 to 
mark a command argument identified segment as being free. 
Execution of this command is conditional in that the log 
device pseudo-device driver 61 will independently verify 
that no log blocks in the identified data segment are still 
valid. If the segment map indicates any valid log blocks, the 
command is rejected with an EBUSY return code. This 
command is used in connection with the log cleaning 
process. 

A Read Log Statistics (LO G_REAL__STATS) IOCTL 
command copies the log device pseudo -device driver's 
statistics counters into the address space of a user or kernel 
mode daemon 90. The daemon 90 must provide a buffer of 
appropriate size. 

A Turn Tracing On/Off (LOG_TRACE_ON; LOG__ 
TRACE_OFF) IOCTL command serves to turn tracing on 
and off. When tracing is enabled, all I/O events captured 
through entry points stolen by the log device pseudo-device 
driver 61 are recorded and made available to user and kernel 
mode daemons 90 for analysis and generation of other 
IOCTL commands that may be used to adaptively and 
dynamically modify the operating parameters of the log 
device. 

Finally, a Read Trace Data (LOG_READ_TRACE) 
IOCTL command copies the contents of a trace buffer into 
the address space of a user or kernel mode daemon 90. The 
daemon 90 must provide a buffer of appropriate size. 

The log manager routine 74 interoperates with the log 
map routine 72 and the segment I/O routines 78 to imple- 
ment the physical log structure layout established on the log 
device. The log map routine 72 manage a number of 
translation maps ul Uriiatelv u tilized to establ^ )b|Jnprica] c pr - 
^stnred infj^frloc T^ wi^jn 
as stored nn the Irw (fcyjpp,. arfd the data 
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block storage locations within the filesystem layout structure segment I/O routine 78. The balance of the current data 

established for the logged filesystem on the main filesystem segment may be filled with new data blocks written through 

disks 40. These log maps permit data blocks to be referenced the data interface 66 or as a result of cleaning the new log 

by the log device pseudo-device driver 61 based on logical tail data segment. Where data blocks are actively being 

references specific to the organization of the file system 5 directed through the data interface 66 for storage on the log 

established on the main filesystem disks. By maintaining the device, the compacted data blocks obtained from the prior 

appearance of a single location representation for all user log tail data segment may be mixed in order of receipt by the 

data that is passed through the log device pseudo-device segment I/O routine 78 into the current segment buffer 

driver 61, independent and even multiple filesystem orga- maintained by the segment I/O routines 78. Once a data 

nizations on the main filesystem disks 40 can be supported 10 segment is full or an IOCTL command is received to flush 

through the log device transparently with respect to the out the current segment, the log manager routines 74 direct 

kernel mode operating system core 62 itself. the writing out of the new data segment to the log device 

The log map routines 72 are called variously by the disks 52. 
IOCTL interface 64, log manager routines 74, and log Equivalently, a log tail data segment read from the log 
cleaner routines 76 to resolve location translations, update 15 device disks 52 may be compacted through the operation of 
the various log maps and to pass the log maps to the segment the log cleaner and log manager routines 76, 74 and added 
I/O routines 78 for storage as apart of one or more user data to any newly written data blocks in the segment buffer 
segments or full meta-data segments written out to the log maintained by the segment I/O routines 78. In all events, the 
device disks 52. The log map routines 72 are also called to log manager routines 74 and segment I/O routines 78 receive 
recover part or all of the log maps stored by the log device 20 each data block for appending into the current data segment 
disks 52 upon initialization of the log device pseudo-device buffer. As each data block is appended, the log manager 
driver 61 and progressively during ongoing operation so as routines 74 call the log map routines 72 to make correspond- 
to support having only a portion of the log maps physically ing updates to the log maps. Once the log manager routines 
resident within the kernel mode primary memory 16, 18 as 74 determine or are notified that the segment buffer is 
allocated to the log map routines 72. 25 effectively full of user data, the log manager routine 74 

The log manager routine 74 implements volume logging directs the log maps routines 72 to append a segment trailer, 

operations that establish and maintain the particular layout including at least the relevant segment map entries, and the 

of the log structured filesystem on the log device disks 52. segment I/O routines 78 to then write out the data segment 

The reading and writing of data blocks relative to the log to the log device disks 52. 

device is initiated by the I/O steal routine 68 by issuing read 30 As each data segment is conclusively written to or read 
and write directives to the log manager routines 74. Location from the log device disks 52, a log device superblock 
translations are obtained as needed by calls to the log map manager 70 is called, typically through or by the log 
routines 72. The log manager routines 74 then provide for manager routine 74, to perform a corresponding update of 
the construction of a data segment containing write data by the superblock maintained on the log device. In a preferred 
the segment I/O routines 78, within a segment buffer man- 35 embodiment of the present invention, the log device super- 
aged by the segment I/O routines 78, that when deemed block manager 70 calls through the I/O steal routines 68 to 
complete, are effectively transferred by calls to the I/O steal the main disk device driver 34 to read and write superblocks 
routines 68 to transfer the data segment to the main disk maintained on the log device disks 52 independent of the 
device driver 34 for writing to the log device disks 52. segment I/O routines 78. By providing the log device 

Similarly, a read directive received by the log manager 40 superblock manager 70 with an independent data path to the 

routines 74 results in a location translation call to the log log device disks 52 relative to the segment I/O routines 78, 

map routines 72 and a call to the segment I/O routines 78 to the co-functioning of the log device superblock manager 70 

request at least a portion of an identified segment to be read and segment I/O routines 78 is simplified. In addition, 

in from the log device disks 52 through the main disk device maintaining superblock integrity is preferably treated as a 

driver 34 and into a segment buffer maintained by the 45 high priority responsibility by the log device pseudo-device 

segment I/O routines 78. The original read requested data driver 61. A flush IOCTL command received by the IOCTL 

block or blocks can then be read from the segment and interface 64 is preferably communicated directly to the log 

passed through the data interface 66 to the kernel mode device superblock manager 70 to initiate a superblock 

operating system core 62. update. 

The log cleaner routines 76 provide for the periodic 50 The volume trace routines 84, monitor routines 86, and 

servicing of the active segments stored on the log device event trace routines 88 are provided, as part of the log device 

disks 52. By providing IOCTL commands through the pseudo-device driver 61 to collect information on the state 

IOCTL interface 64, the log cleaner routines 76 direct the and ongoing operation of the log device pseudo-device 

segment I/O routines 78 to read a copy of the data segment driver 61. The volume trace routines 84 collects a variety of 

at the current tail of the log on the log device into a segment 55 trace records that serve to document data block reads and 

buffer. Each of the data blocks within the data segment held writes, as reported from the I/O steal routines 68, data 

by the segment I/O routine 78 are examined against the log segment reads and writes, as reported from the segment I/O 

maps to determine whether the data block remains used and routines 78, and configuration management directives as 

valid. The data segment may contain data blocks that are not received and handled by the IOCTL interface 64. 

used and represent nothing more than filler in a partial data 60 The monitor routines 86 collect a variety of statistics 

segment. Data blocks may also be invalidated for any concerning principally the effective performance of data 

number of different reasons, resulting in a corresponding log transfers performed by the I/O steal routine 68. The moni- 

map entry to have been marked invalid or that now effec- tored statistics preferably allow for a direct analysis of the 

tively points to a superseding data block within a subse- number, size and type of data block transfers and data 

quently written data segment. The log manager routines 74 65 segment transfers performed through calls to the I/O steal 

are called upon by the log cleaner routines 76 to compact the routines 68. The collected information permits performance 

valid and used data blocks within the segment held by the analysis on both a per request basis and a unit time basis. 
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Particular statistics monitored include the rate of and relative The device number is used as an upper input element to 

proportion of read and write requests received through the the translation routine 100, and the block number is used as 
IOCTL interface 64. Also monitored are the relative rates of a lower input element. The two input elements are utilized 

reads and writes of data segments to the log device disks 52, together to generate a data segment address The data block 

the frequency that read requests are satisfied from the log 5 specified by the device number and block number is or will 

device as opposed to the main filesystem disks 40, and the be stored m a lo S block within a data segment at the data 

effective rate of compaction of data segments as log tail data segment address. 

segments are cleaned and new data segments are written to , The device aumber 15 iaitiall y ^related through the use 

the log head °* a device ta °l e 1"2 that serves to map kernel mode 

Finally, the event trace routines 88 collects trace records 10 °P cratin g svstem core provided device numbers to a con- 

of operational events that occur within the log device f™* 1 of smal integers that the log device pseudo- 

*,... „ , . , ... . 6 . ... device driver 61 utilizes internally to separately identify 

pseudo-device driver. Preferably ma^bleevents s.gnalkd , d fi , e eslablished ^ the ^ tem 

in response to independent calls to Uie 1UC 1 L interlace 64, ^ M nt> the device Qumbets 

are limited to 

I/O steal routines 68, log manager 74, volume trace routines four bits each up tQ 15 me m lems m be 

84, the monitor routines 86 and segment I/O routines 78 are is logged through a singular instance of the log device pseudo- 

recorded on an on-going basis. Conditional variables within device j^ver 61 A of me contents of the device table 

the various signaling routines allow event signals to be 102, including device name, device number and internal 

separately enabled and independently masked. Preferably, device number are preferably saved as part of the superblock 

each of the different routines reporting to the event trace on the log device disks 108. A typically initialization time 

routines 88 signal events that are useful in operational 20 execution of the log device superblock manager 70 performs 

debugging and exception analysis. the initial superblock read and validation. The initialization 

The trace records and operational statistics collected by routine then restores from the superblock the contents of the 

the volume trace routines 84, monitor routines 86, and event device table 102 for subsequent use by the log map routines 

trace routines 88 are preferably accessible by both user and 72. In addition, an IOCTL command to initiate the logging 

kernel mode daemons 90 executing under the control of the 25 of a new filesystem on the main filesystem disks 40, or to 

kernel mode operating system core 62. The event traces and detach a previously logged filesystem results in a corre- 

information received by the daemons 90 is preferably used sponding update of the device table 102 by the log map 

as the basis for dynamic analysis and adjustment of certain routines 72. The new contents of the device table 102 written 

aspects of the on-going operation of the log device pseudo- out to the log device with the next update of the log device 

device driver 61. Dynamic reconfiguration or tuning of 30 superblock by the log device superblock manager 70. 

fundamental operational parameters affecting the operation The internal device number and the block number are then 

of the log device pseudo-device driver 61 and the structure provided as inputs to a range compression algorithm 104 

of data segments as written out to the log device disks 52 is that places the significant information provided by the 

preferably directed by one or more daemons 90 through internal device number and block number into a single word 

operating system calls to the kernel mode operating system 35 wide 32-bit value. In accordance with the present invention, 

core 62, resulting in the issuance of IOCTL commands to the the block number can be scaled down, or right shifted, where 

IOCTL interface 64. Thus, daemons 90 may operate to the basic data block size used by the log device is larger than 

implement any of a number of different log device specific that of the logged file system on the main file system disks, 

operational strategies and further dynamically vary these Typical block sizes for the main filesystem disks 40 are 512 

strategies specifically in view of the analyzed performance 40 bytes and 1 kilobyte (kbyte). Log block sizes may range 

of the log device pseudo-device driver 61. Although the from 512 bytes to 8 kbytes or more, 

fundamental parameters of a main filesystem layout may be The unused most significant bit positions of the block 

conventionally static, dynamic tuning of the log disk per- number can also be trimmed, resulting in preferably a 24 bit 

formance allows an optimization of both the write data wide value. The precision of this value determines the 

efficiency to the log device and the read data efficiency from 45 number of log blocks that can be addressed on a particular 

the main filesystem disks. Consequently, as application load log device drive. Thus, for 8 kbyte log blocks, the 24 bit 

characteristics concerning the size, type and frequency of internal block number can uniquely address 128 gigabytes of 

data reads and writes changes, the configuration of the log log disk storage. For 512 byte blocks, the 24 bit internal 

device pseudo-device driver 61 can be continually adjusted block number can address eight gigabytes of log disk 

to obtain and maintain optimum performance. 50 storage. If a 26 bit internal block number is utilized, between 

The top level translation process flow 100 implemented as 2 and 32 kbytes of log block address space can be refer- 

an algorithm within the log map routines 72 is shown in FIG. enced. Thus, even while reserving the two most significant 

4. In connection with a data block read or write request bits of word wide output value for use as flags, the internal 

directed to the IOCTL interface 64, the kernel mode oper- block number and internal device number can be concat- 

ating system core 62 provides a logical data block number 55 enated together within a one word 32 bit value, 

and a logical device number that serve to uniquely identify The internal device/block number produced by the range 

the source or destination data block for the request. The compression algorithm 104 is provided to a translation table 

block number is preferably the literal sequential block algorithm 106. The lower half-word of the internal device/ 

number within the logical storage space of a particular disk block number is preferably utilized, due to its relatively or 

drive that the given data block is to be written to or read 60 likely more uniform distribution of values, as a hash address 

from. Typically, the block number is a double word or 64-bit index into a super map table 108. The hash index selected 

wide unsigned integer. The device number is used by the value read from the super map table 108 is preferably 

kernel mode operating system core 62 to select a specific combined with the upper half-word of the internal device/ 

disk drive for the data block read or write. This allows the block number to specify an address offset into a map cache 

main disk device driver 34 to support, through disk control- 65 table 110. 

lers 80, 82, any number of physical disk drives supporting a In practice, particularly for determining whether the log 

number of logically distinct filesystems. device stores a particular data block, the lower half-word 
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(LKey) is applied as hash index value for selecting a 
corresponding entry in the super map table 108. The value 
stored by the super map 108 corresponds to an effective y 
indices into the map cache table 110. The upper half-word, 
with the two most significant bits masked, forms an upper 
key (UKey) that is utilized as a content addressable value for 
selecting a particular x indices entry in the map cache table 
110. The entry identified in the map cache 110, when read, 
provides the upper key and a segment number concatenated 
as a one word wide output value [Ukey,SegNo] from the 
translation table 106. 

The combined upper key and segment number value is 
then used as a row index into a segment table 112. Each row 
of the segment table 112 potentially stores an external log 
device disk number and log block number [LogDevNo, 
LogBlkNo] sufficient to address any log block stored in any 
of the data segments on the log device disks 52. If a 
particular log block is identified as used and valid by status 
data also stored by the segment row, the segment containing 
the desired data block can be validly read to obtain the 
desired data block. 

When writing a new data segment to the log disk, the 
translation algorithm 100 is pro-actively executed creating 
new entries in the supermap table 108 and map cache 110 as 
needed to reference an empty entry in the segment table 112. 
The address of the first available log block within the first 
free data segment on the log disk and the corresponding log 
device number are then written into the newly allocated 
entry in the segment table 112. Eventually, the data segment 
is itself written to the log device disks 52. Subsequent 
references by the kernel mode operating system core 62 to 
a specific device number and block number [Ukey,Lkey] 
combination for a logged filesystem will then uniquely 
evaluate to this particular row entry in the segment table 112, 
at least until the data segment is relocated through cleaning 
or otherwise invalidated and the segment row updated. 

When a request is received to read a data block from a 
main filesystem whose entry points have been stolen, the log 
device pseudo-device driver 61 is required to determine 
whether the requested data block is currently stored on the 
log device disks 52. Where, the corresponding logged file- 
system has at least not been detached, the translation algo- 
rithm 100 is utilized to identify the potentially correspond- 
ing log device drive and log block from the segment table 
112. If the successively identified entries in the super map, 
and map cache are empty or invalid, the lookup fails and the 
read request must be satisfied from the main filesystem disks 
40. If the lookup succeeds through to the segment table, the 
selected segment table entry is examined. Each used and 
valid segment table entry stores a device number and block 
number entry for each data block stored within the data 
segment. The data blocks within the data segment and the 
device/block numbers in the segment table entry are stored 
with an ordered correspondence. Other similarly ordered 
status bits stored by the segment table entry specify whether 
a corresponding data block is used and valid. Thus, a data 
block specifically referenced by the operating system core 
62 can be determined to be stored on the log device and, 
further, distinguished as being presently valid or invalid 
before the data segment need be read in by the log device 
pseudo-device driver. Since data segments are of known 
size, the log block number can be used, subject to a modulo 
data segment size calculation, to identify the data segment 
containing the addressed log block. 

Where the identified data block is marked valid, then at 
least the relevant log block of the data segment can be read 
in by operation of the segment I/O routine 78 and the 
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referenced data block transferred through the data interface 
66 to the operating system core 62. Alternately, where the 
data block is to be migrated or flushed out to the main 
filesystem disks 40, the device and block number for the data 
5 block is passed to and used by the main disk device driver 
34 to specify the transfer destination of the data block on the 
main filesystem disks 40. 

Referring now to FIG. 5, an exploded view is shown of 
the various data structures utilized on or in support of the log 
10 device. FIG. 5a provides a graphical representation of the 
logical storage space of a log device disk. Conventionally, an 
initial small portion of the disk data space, designated H/W, 
is dedicated to storing various hardware parameters that 
qualify the size and type of storage space presented by the 
15 specific implementation of the disk drive. 

Following the hardware parameter block, a log header 
block and a segment data block are provided. As shown in 
FIG. 5b, the log header preferably contains two copies of the 
log device disk superblock. The superblock structure holds 
20 information defining: 

the respective sizes of the entire log, a log block, a data 
segment, and a data block, the logical position of this 
disk in the log (nth of m disks), and a versioned serial 
number identifier; 
25 the ASCII device name and device major/minor number 
of the log device; 
the ASCII device names and device major/minor numbers 
of the main filesystem devices that are being logged, 
3Q including information for each defining the relevant 
state of the filesystem (logged, suspended) and con- 
figuration data for the volume, monitor and event 
routines; 

the log segment number of the most recently written user 
35 data segment and segment map segment; 

the log segment numbers of the first and last free data 
segments; 

the free segment data space low and high water marks; 
and 

40 status flags used to identify whether the log is clean or 
dirty and that a complete superblock image has been 
written. 

Two copies of the superblock are stored in the log header 
block. The superblocks are alternately written to ensure that 

45 at least one copy of the superblock maintained within the log 
header block is valid at all times. The log device superblock 
manager 70 is responsible for alternate copies of the super- 
block to be written to the log header block, to provide each 
superblock copy with a sequential, or versioned, serial 

50 number when a complete instance of the superblock is 
written out to the log header and to mark each superblock 
copy as being valid and complete as written to the log header 
at the conclusion of the writing of each superblock copy. 
The segment data block of FIG. 5a, generally representing 

55 the remaining available data storage space of the log device 
disk drive, is pre -configured to store data segments of 
generally fixed length; the segment size can be selectively 
changed in certain circumstances to effect adaptive optimi- 
zation of the operation of the log device. The data segments 

60 are utilized variously to store copies of the supermap table 
108 (supermap segments), the map cache table 110 (map 
cache segments) and the segment table 112 (segment map 
segments) in addition to user data (user data segments). 
Within the segment data block, as shown in FIG. 5c, the 

65 data segments are arrayed as a continuous sequence of data 
segments distinguished with respect to the use of the log 
device disk as including a physically first segment, a physi- 
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cally last segment, and a current first free and last free relocations block, and a segment summary trailer. The 

segments, adjacent the local log head and local log tail data segment summary header and trailer, not shown in FIG. 5e, 

segments, respectively, within the segment data block. are identical log disk data structures that are intended to 

Ongoing identification of the first free and last free segments contain identical data. The segment summary header and 

of a local log device disk, and thus of the log as a whole, is 5 trailer structures store: 

maintained by the log manager routines 74. Thus, the log is t . M „ t 

1 - i i. .i_cr ln e current segment number, 
operated effectively as a circular data segment buffer span- 

ning the segment data blocks of one or more log device a ^ mG ^ tv pe identifier (user, segment, map cache or 

disks. New data segments are written to the log head while ma P)i 

log tail data segments are progressively cleaned by the log 3Q the segment number of the prior segment of the same 

cleaner routines 76. Meta-data segments are cleaned and, if type; 

still valid, relocated to the first free segment log head. User a segment serial number that is used to version written 

data segments are cleaned and relocated, at least in part, to data segments; 

either the first free segment at the loghead or migrated to the „ . \ „ ♦ u • c i_ 

• ci . j- 1 An t>i • * f? a . c °. £ . , a segment time stamp that serves as a basis for chrono- 

main filesystem disks 40. The identifications of the first and i n _j • **. Jt 

last free segments wraps around the log as new data seg- « lo 8 lcall y ordering written data segments; 

ments are written and old data segments are cleaned from the the segment number of the last map segment; and 

log as a whole. the segment number of the current last free segment in the 

A more detailed view of a data segment is shown in FIG. log. 

5d. The data segment preferably consists of a segment body By requiring data identity between the header and trailer 

used for storing either system data or user data and a trailer 20 of a segment summary, the data content in the blocks 

block that is used to store system data relevant to the data between a header and trailer pair can be uniquely determined 

stored by the segment body of this and potentially other data to be valid. 

segment bodies. The structure of both the segment body and The segment map block is used to store a segment map 

trailer blocks will vary depending on whether the content entry for each of the data blocks within the segment body of 

type of the data segment. 25 the same user data segment. Additional segment map entries 

User data is stored in a user data segment structure may be included in the segment map entry block. These 

generally as shown in FIG. Se. The segment body portion of additional entries pertain to data blocks of other user data 

the data segment includes a concatenated series of generally segments previously written to the log device. Data block 

fixed size log blocks, each containing a concatenated series invalidations and other updates to the segment map entry for 

of individual data blocks. The log block size is at least equal 30 a data block can thus be performed without requiring a 

to the size of a data block, if not many times the size of a data previously written data segment to be either updated in place 

block, and no greater than the size of a data segment. The or updated and relocated to the log head. Updated segment 

chosen ratios of log block size to data block size and map entries, rewritten to the log device as part of a new user 

segment size to log block size are performance optimization data segment, effectively preserve the current state of the 

parameters. 35 segment map without interrupting the continuing stream of 

In a preferred embodiment of the present invention, the data segments that are being written to the log device disks, 

log block size may range from 512 bytes to a generally The markers block stores signature data used to validate 

maximum size of about 60 kilobytes. The block size is that each data block has been written completely to the log 

selected to optimize for the intrinsic or natural characteristic device. Ordered marker entries are stored in the markers 

of the filesystem being logged. Thus, for a filesystem utilized 40 block preferably in direct correspondence with the ordered 

by a data base management system (DBMS), a 2 kilobyte data blocks in the user data segment. In preferred embodi- 

block size may best match the typical filesystem block write ments of the present invention, the markers entry for a data 

by the DBMS. The typical block write size for network file block may be internally generated or simply provided by an 

system (NFS) services is 8 bytes. Consequently, an 8 byte external application that is the source of the user data, 

block size will likely prove optimal for NFS mounted 45 Internally generated signatures are created from a copy of 

filesystems. the first several bytes of a data block. The bytes of the 

In a standard operating configuration utilizing 512 byte markers entry are replaced in the data block itself with a 
data blocks and 8 kbyte log blocks, 4 data blocks can be signature byte value that, if subsequently read and verified, 
packed into a log block. Where the default size of a data ensures that the entire data block was validly written. The 
segment is 64 kbytes in length, 30 log blocks are concat- 50 internally generated signature byte value may also include 
enated to form a segment body. The remaining eight kilobyte additional information that specifies the intended order of 
space is allocated as a data segment trailer block, also the data blocks within a log block. Thus, where a particular 
referred to as a segment summary for user data segments. disk drive may reorder a multiple data block write, the 
Where the data block size is 512 bytes and the block size is intended order can be determined by comparison of the 
8 kbytes, as for logging NFS mounted filesystems, the 55 signature bytes to the ordered entries in the markers block, 
default data segment of 64 kbytes stores seven log blocks to With 512 byte data blocks and four byte signatures, the 
form a segment body. The remaining eight kilobytes may markers block fits within a single data block and is therefore 
again be allocated as the data segment trailer block. assured of being internally correctly ordered. 
However, typical filesystems that support NFS usage either Application programs, particularly sophisticated database 
automatically or can be programmatically set to encode 60 management systems (DBMS) anticipate the potential for 
distinctive headers or identifiers within their write blocks of data block re-ordering and may accordingly provide their 
data. Where such native identifiers include sufficient infer- own signatures. Where a logged filesystem is specified to 
mation to obviate the need for a separate segment summary, use externally provided signatures, the presumed data block 
a 64 kbyte data segment can be used to store 8 full log signatures are simply copied to the markers block. Corn- 
blocks. 65 parison of the signatures from the data blocks with those of 

The segment summary, when used, includes a segment the markers block still serves to ensure that the correspond- 

summary header, a segment map block, a markers block, a ing data blocks have been validly written. 
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The relocations block provides storage for qualification 
data specific to the respective entries in the segment map 
block. These additional qualifications include identifying 
whether individual segment map entries are valid and the 
number of times that individual data blocks within a user 
data segment body have been cleaned by operation of the log 
cleaner routines 76. As with the markers block, each entry in 
the relocation block is preferably four bytes or less, allowing 
the entire block to fit within a single typical 512 byte data 
block. 

A general representation of a segment map segment is 
shown in FIG. 5/. The segment trailer block is constructed 
in a manner substantially the same as the trailer block used 
in a user data segment. The trailer block of a segment map 
segment includes a header block (Hdr), a map summary (not 
shown), a segment signature (not shown) and a pointer block 
(Ptr) that is a second copy of the header block. The header 
and pointer blocks again contain: 

the current segment number; 

a segment type identifier (user, segment, map cache or 
super map); 

the segment number of the prior segment of the same 
type; 

a segment serial number that is used to version written 
data segments; 

a segment time stamp that serves as a basis for chrono- 
logically ordering written data segments; 

the segment number of the last map segment; and 

the segment number of the current last free segment in the 
log. 

Validity markers and relocation counts for each of the 
segment map entries provided in a segment map segment are 
stored in the map segment summary while segment signa- 
tures for each of the data blocks within the segment body are 
stored by the segment signature block. 

Significantly, the pointer block of each data contains the 
segment number of the last previously written user data 
segment and the last previously written segment map seg- 
ment. By utilizing a segment number reference, rather than 
a relative pointer reference, both the system and user data 
segments are not only internally self-describing, but define 
a self-described and recoverable thread of like-type data 
segments. 

As illustrated in FIG. 5/ with respect to FIG. 5e, indi- 
vidual segment map entries are present in the segment map 
block of a user data segment at least to describe the user data 
log blocks that make up a majority of a user data segment. 
As also shown in relation to FIG. 5e, the segment body of 
segment map segments is dedicated to storing contiguous 
portions of the global segment map table. By flexibly 
allowing entire data segments to be dedicated to transferring 
copies of the segment map to the log device, in effect as part 
of the user data write data stream, the present invention 
permits a rapid saving of current global state data to the log 
device. 

The segment body of a segment map segment consists 
entirely of segment map entries. In practice, the total size of 
the global segment map is not only larger than the size of a 
single data segment, but may be several times the size of the 
portion of the segment map maintained in the table 112 
within the memory 16 at any one time. To accommodate 
fractional portions of the segment map table 112 being 
stored in individual segment map segments, the header 
block of a segment map segment also records an nth of m 
segment map segment identifier. Nominally, small sets of 
segment map entries are written out to the log device as part 
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of user data segments. However, whenever the current 
segment map table is flushed to the log device, such as in 
anticipation of a system shut down or in periodic mainte- 
nance of the log device, full segment map segments are 

5 written out to the log device until the entire global segment 
table is validly resident on the tog device disks. In effect, 
periodic flush operations result in the log device being 
check-pointed at intervals, thereby improving and ensuring 
the recoverability of the data held by the log device. 

io Each segment map entry, as individually shown in FIG. 
5/, includes a segment map number and a serial number 
identification. Flag and status information defining the cur- 
rent state of the corresponding data block, and a data block 
inverse translation completes an individual entry. The 

is inverse translation directly preserves the original disk 
address of a corresponding data block as stored within a 
particular data segment. Since both the log device block 
address of a data segment and the inverse translation for 
each data block are preserved explicitly on the log device, 

20 the data location relationships between data as stored on the 
log device and within the main filesystem are fully deter- 
minable by an ordered evaluation of the most recent set of 
segment map segments on the log device and any user data 
segments subsequently and validly written out to the log 

25 device. 

In a manner similar to the use of segment map segments, 
cache map segments, as shown in FIG. 5g and supermap 
segments, as shown in FIG. Sh, are used in flushing the 
contents of the map cache table 110 and supermap table 108 

30 to the log device. Both map cache and supermap segments 
employ segment trailers that have essentially the same 
definition as the segment trailer used in the segment map 
segment. Each cache map entry stored as part of the segment 
body of a cache map segment includes a cache line identifier 

35 (LI), a cache line size value (size), and some number of 
individual map translation entries. As previously described, 
each map translation entry stores a combination of an upper 
key value (UKey) and a data segment number. Similarly, the 
segment body of a supermap segment simply contains an 

40 image copy of an ordered entry (LKey) from the supermap 
table 108. 

The cache map segments and supermap segments, like the 
segment map segments, are periodically flushed from 
memory 16 to the log device periodically during the 

45 on-going operation of the log device and, in particular, in 
immediate anticipation of a system shutdown. Thus, when- 
ever the log device of the present invention is properly shut 
down, the last few data segments at the log tail will be a 
sequence of segment map, cache map and supermap seg- 

50 ments. However, when restarting the operational use of the 
log device, recovery of the cache map and supermap seg- 
ments is not essential. Due to the completeness of the data 
contained within the segment map segments, both the cache 
map and supermap segments can be reconstructed indepen- 

55 dently from the segment map segments. Thus, the segment 
map segments are preferably flushed to the log device at a 
higher priority than any other type of segment, including 
user data segments, particularly where an unanticipated 
failure or operational termination occurs. 

60 The supermap table 108 is shown and described in greater 
detail in relationship to FIG. 6a. The supermap table is used 
as a first level address hash table used to store linear indexes 
into the map cache table 110. The number of entries in the 
supermap is dependant on the number of log block transla- 

65 lions required to be supported for the storage size of the log 
device and correspondingly established log block and seg- 
ment sizes. A preferred log device, referred to as having a 
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reference configuration, utilizes a one gigabyte segment entries existing in the corresponding map cache line. Trans- 
body log device disk with 8 kbyte Log blocks that are lations present in the map cache that are initially allocated 
partitioned into 64 kbyte data segments. A separate block and empty or are subsequently invalidated through deletion 
translation is required for each of the log blocks within the or migration of the underlying segment data, 
log device. Thus, 131,072 block translations are required to 5 Bv design, the map cache lines described above are fully 
uniquely map all of the data that may be stored within the log self-identifying. No relative pointers are utilized or values 
device. By establishing a map cache design parameter where that are hardware configuration specific. As a consequence, 
up to 15 segment references can be stored in each a map map cache lines and, indeed, the entire map cache itself can 
cache line, the complete supermap can be determined to be to the log device disks without any additional 
require a maximum of 8,739 entries. Accordingly, three 10 P 1 ***"* ■ subsequently restored wuh equal speed 

. • j * 4 t . *• Segment table entries, as stored in the segment map 112 

supermap segments are required to store the entire supermap m - n nG ^ ^ {q ^ * ^ fe 

table on the reference log device. . CT „ < t c preferably two words in length. Within the first word of the 

The map cache b representatively shown in FIG to. For ent a em Qumber ^ ^ number fc stQred m ^ 

the reference log device each map cache line is 16 words identification field (ID). A flags field may be used to store 

wide. The first half-word of a cache line contains the linear is mree a slatus field identifies the validity of the log 

index number of a particular map cache line. The second blocks [data blocks?] that make up the corresponding data 

half-word stores a binary value specifying the number of segment. An inverse translation is provided in the second 

map cache entry translations that are validly present in the word of the segment entry. This single word stores an 

remaining 15 words of the cache line. An entry length of internal representation of the original device number and 

zero specifies that the corresponding map cache line is not 20 block number provided to the translation algorithm 100 with 

presently used and any entries present are not valid. Cache a data block. In the reference configuration, 112 translations 

map translations are formed from a half-word upper key can be stored in each segment map entry. 

(UKey) derived from the combined device number and Again, as with both the supermap and map cache 

block number as initially presented to the translation algo- structures, the segment table entries are entirely self- 

rithm 100. The next half-word is allocated to specify the 25 identifying. Consequently, individual segment map entries 

segment number corresponding to the translation. Thus, can be separately or collectively saved out from and read 

each translation entry occupies a single word and between 1 into the in-memory segment map table 112 without further 

and 15 translations may be stored in a single cache map line. processing of the entries. Consequently, the storage and 

Preferably, the cache map table 110 is managed utilizing retrieval of segment table entries can be performed quite 

a pool memory allocation scheme so as to minimize unused 30 efficiently. This efficiency directly supports having only a 

memory within the map cache 110 as stored in memory 16. subset of the full segment map in memory at any one time. 

The cache map entry pool is initialized with a mix of unused Least recently used in-memory segment entries can be 

cache map entries for storing 1, 3, 5, 7 and 15 cache map flushed to the log device to make room for newly referenced 

translations. As translations are added to the cache map, segment entries to be brought into the segment map table 

minimum length cache map lines sufficient to hold an initial 35 112. 

translation are used. Subsequent translations that index The data and control flow diagram 120 illustrated in FIG. 

through the supermap to a common or previously populated 7 generally describes the ongoi ng operation of a^ggh 

co^isleij^wiu>th shown^fllel 



map cache line results in the translation being stored in an consistent-with^he^ra shownTlhirilhis^ 
available translation entry location or causing a copy down^I trated log device is constructed utilizing three physical disk 

of the cache map line to an unused cache map line providing mo drives 122, 124, 126. The data segments that cnmprise*Ttrej^ 

twice as many translation entry locations. Both the new and «*sjoragetsjraT^q6the^ 

pre-existing translations are copied to the new map cache j^o gks^gf ^acb*^ the 

line, the entry length of the cache line undated, and the linear segment data blocks of the three disk drivesT22Tl24, 126 

index number is stored at the beginning of the line. The prior maybe initially unused, eventually actively utilized data 

line is then cleared. By choice of the size of the supermap 45 segments may, as shown, be present on each of the disk 

and the number of available linear index numbers, the «^driyesiestaTffisf3e^ 

potential number of lower keys can hash to a common linear Utilizing more than a single disk drive to store tfie~log allows 

index can be effectively limited to 15 translations. the first free data segment to be located on one drive 122 

The map cache is utilized both in support of data segment while the last free data segment is located on another drive 

reads and writes. Where a data segment is to be written, the 50 126. Consequently, read and write operations to the head and 

lower key hash through the supermap determines a linear tail of the logical log may occur in a largely unconstrained 

index into the map cache. A sequential or, alternately, binary manner independent of one another, 

tree search through the map cache determines whether a 3^ps^ufa* gjerdisirdriY^ ^ 

corresponding linear index identified map cache line is 122,T247126ln£^^ 

presently in use. If not, a cache line is allocated and used. 55 j^atosjWoeksisp^^ of 
Where a corresponding map cache line has already been tie~dnvesT22, 124, 126 therefore identifies its own first free 
allocated, a short series of management determinations are data segment and own last free data segment. The ordered 
made. First, the upper key of the write translation is used a relationship between the individual logrdrives 122, 124, 126 
content addressable selector against the upper keys of the and the first and last free data segment identifiers for the log 
translations present in the map cache line. Where an upper 60 as a whole is preferably maintained internal to the log 
key match is found, the new translation is written into the manager routines 74. This information is preferably initial- 
corresponding translation within the map cache line, effec- ized from the superblocks held by each of the log disk 122, 
tively invalidating the prior translation. Where an upper key 124, 126 by reference to fields within the superblock that 
match is not found, a new cache map line is allocated if and serve to describe each log disk as the nth of m log disk for 
as needed to accommodate storing the new translation entry 65 an identified filesystem on the main filesystem disks, 
in the cache line. In both events, the entry length value is When first initialized, both the head and tail of the active 
updated to correctly identify the number of valid translation log will exist on the same log disk drive. As data segments 



03/21/2004, EAST Version: 1.4.1 



6,021,408 

29 30 

are received ultimately from the host computer system responsive to log device layout changes directed by the 

through the log device pseudo-device driver, separately adaptive controls. The log cleaner routines 132 thus provide 

illustrated as the write log stream 128, each data segment is for the reading of segments from the last used data segment 

stored at the then current first free segment within the log in one physical layout form while writing out a cleaned data 

device. Since the physical data layout of the log structured 5 segment to the current first free data segment location 

device is entirely hidden from the host, data received from utilizing a new desired physical layout, 
the host is formed into corresponding data segments within With each invocation, the log cleaner routines 132 exam- 

the write log stream 128. ine the log blocks of the last used data segment. Based on the 

The configuration of the log device is, preferably, pro- segment map entries and markers in the trailer of the data 

grammable through IOCTL commands provided through the 10 segment, particularly as compared against the current state 

log device pseudo-device driver, including specifically the of the map cache as held by the map cache table U0, to 

write log stream 128. Dynamic configuration and determine whether log blocks within the last used data 

re-configuration through programmable adaptive controls segment have been invalidated as a consequence for the 

applied to the write log stream 128 on an initialization of the underlying data within a log block overwritten or deleted, 

log device pseudo-device driver 61 and, as desired, during is Individual data blocks within the log blocks may be marked 

the on-going operation of the log device is preferably as invalid and entire invalid log blocks may be dropped from 

provided by the user/kernel mode daemons 90. For example, the data segment. The segment map, markers and relocations 

a current default log block size of 8 kbytes may be dynami- blocks within the segment trailer are correspondingly modi- 

cally reduced to 4 kbyte in order to optimize the operation fied. 

of the log device to smaller data block writes by the host 20 As part of the data segment cleaning, the relocations 

computer. Other parameters that can be adaplively con- information within the user data segment trailer is examined 

trolled include the total size of a log segment and the overall to determine whether any particular log block has been 

size of the log device utilized at one time. Since the write log relocated through cleaning in excess of a threshold number 

stream 128, in effect, contains the tables underlying the of relocations; the threshold number may be set to an 

translation algorithm used to store and retrieve data seg- 25 adaptive control defined value. Individual log blocks that 

ments from the log device, changes to the fundamental have been relocated more than the current relocation value 

structure of the physical data layout may be performed are not incorporated into the new cleaned data segment, 

dynamically without dependance on existing aspects of Rather, the data blocks within the log block are provided 

either the host computer system or the potentially various with their corresponding inverse translations to the main 

filesystem structures present on the main filesystem disks. 30 disk device driver 34 for writing out to the main filesystem 

The adaptive controls permit fundamental aspects of disks 40. In a preferred embodiment of the present 

filesystems provided on the main file system disks to also be invention, the default threshold relocation value is set at 

dynamically modified. For example, where entire data three. This value can be modified dynamically to a lower 

stripes of a RAID-3 configured filesystem can be held value should the segment data storage space provided by the 

entirely by the log device, the data stripe may be progres- 35 log device tend to fill at too high a rate. Alternately, the 

sively written back to the main filesystem disks in the form relocation parameter value may be increased to slow the rate 

of a RAID-5 stripe. This is accomplished by altering the of migration of data blocks from the log device to the main 

translation algorithm 100 to account for the difference in filesystem disks. This permits actively or repeatedly written 

RAID filesystem layout organization. In effect, many fun- data to be better isolated on the log device for a longer period 

damental aspects of a filesystem provided on the main 40 of time to minimize migration writes to the main filesystem 

filesystem disks that were previously static can now be disks and preserve greater bandwidth for main filesystem 

dynamically modified through a progressive logging and disk read operations. 

rewriting of the data to or between filesystems on the main Finally, the log device pseudo-device driver 61 routines 

filesystem disks. responsible for managing the read log stream 130 are 

In as much as adaptive changes can be applied to the 45 responsible for initially determining whether a host 

physical layout of the log disks to optimize operation for requested data block is validly present on the log device, 

writing, the present invention thus efficiently permits the Again, this determination is made through the execution of 

filesystem layout on the main filesystem disks to by dynami- the translation algorithm 100 to determine whether a log 

cally altered particularly to optimize operation for data block contains the requested data block. Where there is no 

reads. so matching translation entry, the read log stream routines 130 

In the preferred operation of the log disk pseudo-device pass the read request onto the main disk device driver, 
driver, data segments are populated on a disk 0 122 until However, where a valid match is found, the read log 

approximately 70 percent of the data segments are actively stream 130 is responsible for reading in the corresponding 

used. The remaining 30 percent of free disk segments are data segment. Since read operations may be random, the 

maintained open preferably to receive a flush of system data 55 requested data segment may lie on any of the log disks 122, 

segments should an immediate need for a flush operation 124, 126 within the log device, as illustrated. By utilizing 

occur. multiple log disks, the chances that the read request must be 

Once a log disk has reached the filled segment threshold, satisfied from the same log disk that includes the first free 

the head of the logical log wraps to the next log disk in segment of the log head is reduced. The operation of the 

sequence. Thus, the log structured device operates as a 60 buffer cache within the primary memory 16 further serves, 

logically continuous circular buffer for data segments. The in operation, to reduce the occurrence of disk read requests 

log tail is continually cleaned on a generally periodic basis for a given data block close in time to write data requests for 

by the background operation of the log cleaner routines 132. the same data block by the host computer system. 

Since the log cleaner routines 132 are responsible for Consequently, a substantial majority of read data requests 

cleaning log blocks from the last used data segment to 65 actually satisfied from the log device through the read log 

construct new data segments to be written to the current first stream routines 130 will occur on log disks 124, 126 

free segment of the log, the log cleaner routines 132 are also separate from the log disk 122 that maintains the then 
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current first free data segment of the log. As a result, body 
and tail log disks 124, 126 can be almost exclusively used 
for data segment read operations. Conversely, the log disk 
122 with the current first free segment of the log is substan- 
tially shielded by the operation of the buffer cache within the 5 
primary memory 16 and, therefore, performs essentially 
only write data segment operations. 

Thus, a log device system providing for an improved 
utilization of filesystem read and write operational band- 
width has been described. The log device, appearing to the 10 
kernel mode operating system core as the original device 
driver entry points of the main disk device driver transpar- 
ently hides both the implementation and operation of a log 
device subsystem independently implemented on an array of 
log disks. The resulting effectively composite filesystem 15 
established by the combined operation of the log device and 
the native filesystems supported by the main disk device 
driver not only allows optimization of read and write data 
operations, but further allows initially established data lay- 
out parameters to be dynamically adjusted to maintain 2 q 
optimal independent response to read and write data requests 
as the relative nature of such requests change with different 
application program loads and mixes as executed by the host 
computer system. 

Naturally, many modifications and variations of the 2 $ 
present invention are possible in light of the above descrip- 
tion of the preferred embodiments. However, the modifica- 
tions and variations that will be readily apparent to those of 
skill in the art may be practiced without departing from the 
scope and nature of the present invention as set forth in the 30 
appended claims. 

We claim: 

1. A method of storing and retrieving data by a computer 
system executing an operating system and supporting first 
and second persistent storage devices, said operating system 35 
including a filesystem module coupled through a first device 
driver to said first persistent storage device to transfer 
filesystem data blocks, said filesystem module providing 
support for synchronous write transactions, said method 
comprising the steps of: ^ 

a) providing a second device driver selectively coupled 
between said filesystem module and said first device 
driver; 

b) collecting a predetermined set of data blocks provided 
from said filesystem module as part of a synchronous 45 
write transaction into a data segment, storing said data 
segment on said second persistent storage device, and 
signaling completion of said synchronous write trans- 
action to said filesystem module, wherein said step of 
collecting further provides for 50 

1) constructing a map relating said predetermined set of 
data blocks to said data segment, whereby said 
second device driver can identify said data segment 
by reference to any of said predetermined set of data 
blocks by said filesystem module; and 55 

2) storing said map, including progressively updated 
versions of said map, on said second persistent 
storage device, whereby said map can be recon- 
structed from said second persistent storage device; 

c) migrating said predetermined set of data blocks from 60 
said second persistent storage device to said first per- 
sistent storage device by said second device driver 
through use of said first device driver independent of 
said filesystem module, wherein said data segment 
includes a predetermined set of address references for 65 
said predetermined set of data blocks and wherein said 
step of migrating said predetermined set of data blocks 
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to said first persistent storage device is performed by 
said second device driver dependant on said predeter- 
mined set of address references; 

d) retrieving any of said predetermined set of data blocks 
from said first persistent storage device; 

e) selectively bypass writing said predetermined set of 
data blocks, as provided from said filesystem module, 
through said first device driver to said first persistent 
storage device; and 

f) selectively bypass reading said data segment from said 
second persistent storage device to transfer any of said 
predetermined set of data blocks to said filesystem 
module. 

2. The method of claim 1 further comprising a step of 
evaluating said map when said filesystem module provides 
an address reference to determine if said address reference 
is in said predetermined set of address references. 

3. The method of claim 2 further comprising a step of 
providing a write data cache within a memory space of said 
computer system to minimize use of said step of selectively 
bypass reading. 

4. A method of storing and retrieving data within a 
computer system that is coupleable to a plurality of disk 
drives, where said computer system executes an operating 
system, including a filesystem and a main filesystem device 
driver and where said filesystem defines a data layout for the 
storage of data blocks within the addressable storage space 
of a main disk drive, said method comprising the steps of: 

a) including a log device driver transparently coupleable 
between said filesystem and said main filesystem 
device driver and coupleable to a log disk drive; 

b) selectively transferring data blocks between said 
filesystem, said main disk drive and said log disk drive, 
wherein data blocks, identified by filesystem address 
and provided by said filesystem, are preferentially 
written to said log disk drive and wherein data blocks 
identified by filesystem address are preferentially read 
from said main disk drive for transfer to said filesys- 
tem; 

c) determining, by said log device driver, whether to 
transfer a predetermined data block, having a predeter- 
mined filesystem address, to said main disk drive or 
said log disk drive based on a data block transfer load 
comparison between said log device disk drive and said 
main disk drive, whereby said log device driver seeks 
to balance the data block transfer load of said log 
device and main disk drives. 

5. The method of claim 4 further comprising the steps of: 
collecting a plurality of data blocks determined to be 

written to said log disk drive into a log data segment; 
and 

establishing a mapping between the filesystem addresses 
of said plurality of data blocks and a segment address 
of said log data segment as storable by said log disk 
drive. 

6. The method of claim 5 further comprising the step of 
writing predetermined portions of said mapping to said log 
disk drive. 

7. The method of claim 6 further comprising the step of 
writing, to said log disk drive, the main filesystem addresses 
of said plurality of data blocks in connection with the writing 
of said log data segment to said log disk drive. 

8. The method of claim 7 further comprising the step of 
relocating said log data segment within the addressable 
storage space of said log disk drive while maintaining the 
connection between said predetermined plurality of data 
blocks and their corresponding main filesystem addresses. 
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9. The method of claim 8 wherein said step of relocating 
is performed at predetermined intervals and wherein said 
predetermined plurality of data blocks are written to said 
main disk drive after being relocated on said log disk drive 
a predetermined number of times. 

10. The method of claim 4 wherein said log disk drive 
provides for storage of a predetermined plurality of log data 
segments and wherein said method further includes the steps 



of said predetermined data block to said log disk drive and 
to said main disk drive. 

16. The method of claim 15 further comprising the step of 
writing at intervals at least portions of said translation map 
to said log disk drive. 

17. A method of storing and retrieving data within a 
computer system that is coupleable to a plurality of disk 
drives, where said computer system executes an operating 
system, including a filesystem and a main filesystem device 



of: - - 

. r . , -n driver to provide a first addressable storage space, having a 

a) u-ansfemng said predetermined data block to said log 30 ^ ^ 0fganization defined b * fil ^ 

disk drive as part of a first log data segment at a first a main ^ drive lo aUow for me slorage of filesys . 

tem data blocks, said method comprising the steps of: 

a) providing a log device driver coupled between and 
effectively transparent to said filesystem and said main 
filesystem device driver, said log device driver provid- 
ing a second addressable storage space within a log 
device disk drive and having a second data storage 
organization defined by said log device driver, said 
second addressable storage space providing for the 
storage of a predetermined data log segment that 
includes a predetermined filesystem data block; 

b) enabling the selective transfer of said predetermined 
filesystem data block from said filesystem to said log 
device disk drive, and from said main disk drive and 
said log device disk drive to said filesystem; and 

c) selecting to transfer said predetermined filesystem data 
block between said filesystem, said log device disk 
drive and said main disk drive to preferentially 
isolating, from the apparent perspective of said 
filesystem, read transfers of said predetermined data 
block to said main disk drive and write transfers of said 
predetermined data block to said log device disk drive, 
thereby enabling said operating system to transparently 
access and selectively transfer filesystem data blocks 
with respect to said second addressable storage space in 
place of said first addressable storage space. 

18. The method of claim 17 wherein said step of enabling 
includes storing predetermined system data with said pre- 
40 determined data block within said second addressable stor- 
age space, said predetermined system data establishing a 
relation between a log device storage location of said 
predetermined data block within said second addressable 
storage space and a main storage location of said predeter- 
mined data block within said first addressable storage space 
as defined by said filesystem. 



segment address within said log disk drive; 

b) reading said first log data segment from said log disk 
drive; 

c) selectively transferring said predetermined data block 
to said log disk drive as part of a second log data 
segment at a second segment address within said log 
disk drive; and 

d) selectively transferring said predetermined data block 2 o 
to said main disk drive. 

U. The method of claim 10 wherein said predetermined 
filesystem address of said predetermined log data block is 
maintained in connection with said predetermined log data 
block as stored by said log disk drive and wherein said step 
of selectively transferring said predetermined data block to 
said main disk drive utilizes said predetermined filesystem 
address in transferring said predetermined data block to said 
main disk drive. 

12. The method of claim 11 wherein said predetermined 
filesystem address is stored on said log disk drive in corre- 
spondence with said predetermined log data block. 

13. The method of claim 12 wherein the correspondence 
of said predetermined filesystem address with said prede- 
termined log data block is maintained independent of 35 
whether said predetermined log data block is part of said first 
or second log data segment. 

14. The method of claim 13 further comprising the step of 
determining whether said predetermined log data block has 
been read and transferred back to said log disk drive a 
threshold number of times whereupon said predetermined 
log data block may be selectively transferred back to said log 
disk drive or selectively transferred to said main disk drive. 

15. The method of claim 14 further comprising the step of 
maintaining a translation map for reference to determine 45 
whether said predetermined data block is stored as part of 
said first log data segment or as part of said second log data 
segment, said translation map being updated with each write 
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